Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures
نویسندگان
چکیده
Coresets are e cient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KLdivergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.
منابع مشابه
Clustering with Bregman Divergences
A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clust...
متن کاملSimplification and hierarchical representations of mixtures of exponential families
A mixture model in statistics is a powerful framework commonly used to estimate the probability measure function of a random variable. Most algorithms handling mixture models were originally specifically designed for processing mixtures of Gaussians. However, other distributions such as Poisson, multinomial, Gamma/Beta have gained interest in signal processing in the past decades. These common ...
متن کاملParameter Estimation in Finite Mixture Models by Regularized Optimal Transport: A Unified Framework for Hard and Soft Clustering
In this short paper, we formulate parameter estimation for finite mixture models in the context of discrete optimal transportation with convex regularization. The proposed framework unifies hard and soft clustering methods for general mixture models. It also generalizes the celebrated k-means and expectation-maximization algorithms in relation to associated Bregman divergences when applied to e...
متن کاملScalable and Distributed Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...
متن کاملAlgorithms for the Bregman k-Median problem
In this thesis, we study the k-median problem with respect to a dissimilarity measure Dφ from the family of Bregman divergences: Given a finite set P of size n from R, our goal is to find a set C of size k such that the sum of error cost(P,C) = ∑ p∈P minc∈C { Dφ(p, c) } is minimized. This problem plays an important role in applications from many different areas of computer science, such as info...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016